Are we alone in the Universe? This is one of the most profound questions that humankind has sought to answer since the beginning of recorded history. We can gain some insight into this mystery with the modern search for exoplanets. The underlying purpose of contemporary exoplanet programs is to discover habitable planets, especially around nearby stars, and find evidence of life elsewhere in the cosmos. Using Earth and its lifeforms as a template to assess habitability, we seek other celestial bodies with conditions similar to our own, i.e., with Earth-like surface gravities and temperatures where liquid water could exist. In order to meet these criteria, most habitable worlds must be a particular mass and a particular distance from their host star. We are interested in analyzing the data for confirmed exoplanets to asses our progress in finding a planet with similar physical characteristics to Earth.
For our investigation, we used data from the NASA Exoplanet Archive, which can be found at the following URL https://exoplanetarchive.ipac.caltech.edu. This archive is an online compilation, collation, and cross-correlation of astronomical data and information on exoplanets and their host stars. The data are vetted by a team of astronomers at the California Institute of Technology (Caltech) under contract with the National Aeronautics and Space Administration (NASA) under the Exoplanet Exploration Program. An extensive overview of the data, services and tools of the archive can be found in a published paper by Akeson, et al. (2013, PASP, 125, 989) in the Publications of the Astronomical Society of the Pacific (PASP). This publication can be found here.
We downloaded our dataset from the NASA Exoplanet Archive on 2018 July 5. The data consists of 354 columns and 3,748 rows of information on confirmed exoplanets and their host stars as well as information about their discovery. Discovery information includes the method used to detect the exoplanet, the locale of the observatories used for detection (i.e., whether ground-based, space-based, or a mixture of both observations were used for detection), and the year of discovery. Physical characteristics of exoplanets present in the archive include planetary mass, orbital period, and orbital semi-major axis. Physical properties of host stars listed in the archive include stellar mass, stellar radius, effective temperature, surface gravity, spectral type, luminosity, and distance from Earth. These physical properties for both exoplanets and their host stars are important for determining the similarity between an exoplanet and Earth, and their detection sensitivity.
We trimmed and removed space characters from the original dataset from the NASA Exoplanet Archive to the 15 columns that are most relevant and directly related to habitable planets. These data are stored in the file planets.csv. The variable for which we are most interested as a response for this study is pl_orbsmax, the planet orbital semi-major axis in astronomical units (\(\rm{AU} = 1.495978707 \times 10^{11}\,\rm{m}\)). This response variable is a key physical parameter that determines the habitability of a planet.
To ensure we had a consistent dataset with no empty values for any responses or predictors for which we were considering, we filtered out rows with any empty values and were left with 594 rows.
planets <- read.csv("planets.csv")
planets_good <- planets[complete.cases(planets), ]
The structure of the R object for which we will perform our analysis is as follows.
str(planets_good)
## 'data.frame': 594 obs. of 15 variables:
## $ pl_discmethod: Factor w/ 10 levels "Astrometry","EclipseTimingVariations",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ pl_disc : int 2007 2009 2008 2002 1996 2018 2010 2010 2009 2008 ...
## $ pl_locale : Factor w/ 3 levels "Ground","MultipleLocales",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ st_dist : num 110.6 119.5 76.4 18.1 21.4 ...
## $ st_optmag : num 4.74 5.02 5.23 6.61 6.25 ...
## $ st_teff : num 4742 4213 4813 5338 5750 ...
## $ st_mass : num 2.7 2.78 2.2 0.9 1.08 0.99 1.54 1.54 1.93 0.98 ...
## $ st_rad : num 19 29.79 11 0.93 1.13 ...
## $ st_logg : num 2.31 1.93 2.63 4.45 4.36 2.42 3.5 3.5 4.43 1.71 ...
## $ st_metfe : num -0.35 -0.02 -0.24 0.41 0.06 -0.77 -0.03 -0.03 0.19 -0.46 ...
## $ st_vsini : num 1.2 1.5 2.6 1.6 2.18 3.36 2.77 2.77 38.3 1 ...
## $ pl_orbper : num 326 516 186 1773 798 ...
## $ pl_orbsmax : num 1.29 1.53 0.83 2.93 1.66 ...
## $ pl_orbeccen : num 0.231 0.08 0 0.37 0.68 0.042 0.09 0.29 0.29 0.38 ...
## $ pl_bmasse : num 6166 4685 1526 1481 566 ...
There are two factor variables and 13 numeric variables. The following table briefly describes the variables.
| Variable name | Variable description |
|---|---|
pl_discmethod |
Method of discovery (RadialVelocity, Transit, TransitTimingVariations, etc.) |
pl_disc |
Year of discovery |
pl_locale |
Locale of discovery (Ground, MultipleLocales or Space) |
st_dist |
Stellar distance (parsecs) |
st_optmag |
Stellar apparent magnitude (mag) |
st_teff |
Stellar effective temperature (Kelvin) |
st_mass |
Stellar mass (solar masses) |
st_rad |
Stellar radius (solar radii) |
st_logg |
Stellar surface gravity (log g) |
st_metfe |
Stellar metallicity (log(Fe/H or M/H)) |
st_vsini |
Stellar projected rotational velocity (m/s) |
pl_orbper |
Planet orbital period (days) |
pl_orbsmax |
Planet orbital semi-major axis (AU) |
pl_orbeccen |
Planet orbital eccentricity |
pl_bmasse |
Planet mass (Earth masses) |
Some of these variables may be collinear to each other. We can inspect for collinearity visually using a matrix of scatterplots of the variables.
From the matrix of scatterplots, it appears that planet orbital period (pl_orbper) and orbital semi-major axis (pl_orbsmax) are collinear. This collinearity is not unexpected since orbital period and semi-major axis are related to each other physically as supported in Kepler’s third law of planetary motion. It also appears that stellar radius (st_rad) and stellar surface gravity (st_logg) are collinear as well. Collinearity makes estimating model coefficients and interpreting models more difficult, but it does not affect model predictions. We will keep this in mind as we search for potential models for our data.
The response for which we are most interested for the study exoplanet habitability is planet orbital semi-major axis (pl_orbsmax). The planet orbital semi-major axis is essentially the distance between a planet and its host star. This distance is important to planet habitability because only a certain range of distances between a planet and its host star would allow liquid water to exist on the planet to support life as we know it.
We will first begin with an additive model with pl_orbsmax as the response and the remaining variables as predictors except pl_orbper which is fairly collinear with the response.
pl_orbsmax_mod_add <- lm(pl_orbsmax ~ . - pl_orbper, data = planets_good)
summary(pl_orbsmax_mod_add)
##
## Call:
## lm(formula = pl_orbsmax ~ . - pl_orbper, data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.704 -0.721 -0.144 0.387 9.780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.39e+01 2.44e+01 -3.85 0.00013 ***
## pl_discmethodTransit -1.10e+00 2.37e-01 -4.63 4.6e-06 ***
## pl_discmethodTransitTimingVariations -6.67e-01 9.15e-01 -0.73 0.46665
## pl_disc 4.72e-02 1.21e-02 3.91 0.00010 ***
## pl_localeSpace 2.68e-01 1.92e-01 1.40 0.16342
## st_dist 1.67e-05 2.24e-04 0.07 0.94062
## st_optmag -1.01e-01 4.49e-02 -2.25 0.02513 *
## st_teff 2.83e-04 1.27e-04 2.23 0.02587 *
## st_mass 6.09e-02 1.06e-01 0.57 0.56665
## st_rad -3.03e-02 1.48e-02 -2.04 0.04159 *
## st_logg -1.05e-01 1.53e-01 -0.69 0.49154
## st_metfe 2.05e-01 2.53e-01 0.81 0.41811
## st_vsini -2.24e-02 7.88e-03 -2.85 0.00453 **
## pl_orbeccen 1.09e+00 2.81e-01 3.90 0.00011 ***
## pl_bmasse 2.50e-04 5.09e-05 4.90 1.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.23 on 579 degrees of freedom
## Multiple R-squared: 0.32, Adjusted R-squared: 0.303
## F-statistic: 19.4 on 14 and 579 DF, p-value: <2e-16
From the high p-values from the t-statistic for some of the predictors, a simpler model may be preferred.
We will use a backward BIC search to find a potentially better model.
n <- length(resid(pl_orbsmax_mod_add))
pl_orbsmax_mod_back_bic <- step(pl_orbsmax_mod_add, direction = "backward", k = log(n), trace = 0)
summary(pl_orbsmax_mod_back_bic)
##
## Call:
## lm(formula = pl_orbsmax ~ pl_discmethod + pl_disc + st_teff +
## st_vsini + pl_orbeccen + pl_bmasse, data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.317 -0.779 -0.151 0.355 9.978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.42e+01 2.30e+01 -3.65 0.00028 ***
## pl_discmethodTransit -1.51e+00 1.33e-01 -11.33 < 2e-16 ***
## pl_discmethodTransitTimingVariations -9.08e-01 8.83e-01 -1.03 0.30377
## pl_disc 4.14e-02 1.14e-02 3.64 0.00030 ***
## st_teff 4.02e-04 1.02e-04 3.93 9.6e-05 ***
## st_vsini -2.05e-02 7.69e-03 -2.67 0.00776 **
## pl_orbeccen 1.13e+00 2.76e-01 4.10 4.7e-05 ***
## pl_bmasse 1.99e-04 4.12e-05 4.82 1.9e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.24 on 586 degrees of freedom
## Multiple R-squared: 0.304, Adjusted R-squared: 0.296
## F-statistic: 36.6 on 7 and 586 DF, p-value: <2e-16
The model found by a backward BIC search is considerably simpler; this model has seven parameters compared to 14 for the additive model. Furthermore, most of the predictors have low p-values associated with their t-statistic. However, the adjusted \(R^2\) of 0.2957 could be improved.
Next, we will try transform the response pl_orbsmax. Physically, the span of exoplanet orbital semi-major axis is fairly large. Thus, we are motivated to transform the response with a logarithm function and repeat the previous model search.
log_pl_orbsmax_mod_add <- lm(log(pl_orbsmax) ~ . - pl_orbper, data = planets_good)
n <- length(resid(log_pl_orbsmax_mod_add))
log_pl_orbsmax_mod_back_bic <- step(log_pl_orbsmax_mod_add, direction = "backward", k = log(n), trace = 0)
summary(log_pl_orbsmax_mod_back_bic)
##
## Call:
## lm(formula = log(pl_orbsmax) ~ pl_discmethod + pl_locale + st_teff +
## st_rad + st_logg + st_vsini + pl_orbeccen + pl_bmasse, data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9378 -0.6700 0.0511 0.7175 2.8317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4821714 0.6532614 0.74 0.46075
## pl_discmethodTransit -2.4535970 0.1308455 -18.75 < 2e-16 ***
## pl_discmethodTransitTimingVariations -1.0410831 0.8001113 -1.30 0.19371
## pl_localeSpace 0.8005917 0.1595787 5.02 7.0e-07 ***
## st_teff 0.0003240 0.0000997 3.25 0.00122 **
## st_rad -0.0305686 0.0117936 -2.59 0.00978 **
## st_logg -0.7139814 0.1163375 -6.14 1.6e-09 ***
## st_vsini -0.0176393 0.0069216 -2.55 0.01108 *
## pl_orbeccen 1.7297954 0.2480178 6.97 8.3e-12 ***
## pl_bmasse 0.0001702 0.0000437 3.89 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.1 on 584 degrees of freedom
## Multiple R-squared: 0.595, Adjusted R-squared: 0.589
## F-statistic: 95.5 on 9 and 584 DF, p-value: <2e-16
For this log(pl_orbsmax) model, the adjusted \(R^2\) of 0.5892 is an improvement from the analogous model for pl_orbsmax. We could consider interaction models, but with nine parameters in this latest model already, we will forego deriving an interaction model for now.
We will check the constant variance assumption for our latest model using a residual versus fitted values plot, and using a Breusch-Pagan test.
By visual inspection, the general span of residuals is not very constant across the range of fitted values. This suggests that homoscedasticity or constant variance is suspect.
Next, we will perform the Breusch-Pagan test on the model.
library(lmtest)
log_pl_orbsmax_mod_back_bic_bp <- bptest(log_pl_orbsmax_mod_back_bic)
The p-value from the Breusch-Pagan test for the model is \(1.6911\times 10^{-25}\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(1.6911\times 10^{-25} \ll \alpha\). This further suggests that homoscedasticity is suspect.
We will check the normality assumption for our model using a Q-Q plot, and using a Shapiro-Wilk test.
By visual inspection, the data seems to deviate slightly from normality in the lower part of the Q-Q plot. The normality assumption may be suspect.
Next, we will perform the Shapiro-Wilk test on the model.
log_pl_orbsmax_mod_back_bic_sw <- shapiro.test(resid(log_pl_orbsmax_mod_back_bic))
The p-value from the Shapiro-Wilk test for the model is \(0.0024\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.0024 \lt \alpha\). This further suggests that normality is suspect.
While we could continue to tweak the model with transformations on the predictors or vary model search approaches like direction (e.g., forward or stepwise) or quality estimator (e.g., AIC) to obtain homoscedasticity and normality, we can also try an exhaustive model search. An exhaustive search that is not too large could allow us to systematically try mutually exclusive combinations of linear, logarithmic and exponential transformations of the predictors. For our predictors, there would be 209,952 combinations of models.
We will first setup the storage for the results of the exhaustive model search.
n <- (2^5) * (3^8)
loocv_rmse_results <- rep(0, n)
adj_r2_results <- rep(0, n)
bp_value_results <- rep(0, n)
sw_value_results <- rep(0, n)
num_params_results <- rep(0, n)
We will generate strings that will be parsed and evaluated by R to fit the models. See the appendix for the R code.
We will fit all of the model combinations and store the results.
for (i in 2:n) {
eval(parse(text = paste(lm_str_head, lm_str[i], lm_str_tail)))
adj_r2_results[i] <- get_adj_r2(cur_mod)
bp_value_results[i] <- bptest(cur_mod)$p.value
loocv_rmse_results[i] <- get_loocv_rmse(cur_mod)
num_params_results[i] <- get_num_params(cur_mod)
sw_value_results[i] <- shapiro.test(resid(cur_mod))$p.value
}
We would like to identify models in our collection where we fail to reject the hypothesis of homoscedasticity and normality, and have a reasonably large adjusted \(R^2\) values, e.g., \(\gt 0.5\).
idx <- (bp_value_results > 0.01) & (sw_value_results > 0.01) & (adj_r2_results > 0.5)
There are 161 models where we fail to reject the hypothesis of homoscedasticity and normality for a significance level of \(\alpha = 0.01\).
Of those models, we could choose the model with the lowest leave-one-out cross-validation root mean square (LOOCV-RMSE) value, which could be used to check for overfitting. Overfitting can be implied by an increase in LOOCV-RMSE with an increase in model complexity.
eval(parse(text = paste("log_pl_orbsmax_mod_exh = lm(log(pl_orbsmax)~", lm_str[which(idx)[which.min(loocv_rmse_results[which(idx)])]], lm_str_tail)))
We will use a backward BIC search to see if we could find a simpler model with this model as a starting point.
n <- length(resid(log_pl_orbsmax_mod_exh))
log_pl_orbsmax_mod_exh_back_bic <- step(log_pl_orbsmax_mod_exh, direction = "backward", k = log(n), trace = 0)
We will use this model to gain some insight into the orbital semi-major axes of confirmed exoplanets to date.
The R summary for the model we have chosen to examine the orbital semi-major axis of exoplanets is given below.
summary(log_pl_orbsmax_mod_exh_back_bic)
##
## Call:
## lm(formula = log(pl_orbsmax) ~ pl_locale + st_optmag + st_vsini +
## log(pl_bmasse), data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.832 -0.826 -0.003 0.832 3.166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.32730 0.25613 1.28 0.2
## pl_localeSpace 1.11819 0.17278 6.47 2.0e-10 ***
## st_optmag -0.38017 0.02037 -18.67 < 2e-16 ***
## st_vsini -0.04028 0.00611 -6.60 9.4e-11 ***
## log(pl_bmasse) 0.38803 0.02919 13.29 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.18 on 589 degrees of freedom
## Multiple R-squared: 0.534, Adjusted R-squared: 0.53
## F-statistic: 168 on 4 and 589 DF, p-value: <2e-16
We will check the constant variance assumption for our final model using a residual versus fitted values plot, and using a Breusch-Pagan test.
By visual inspection, the general span of residuals is mostly constant across the range of fitted values although many of the datapoints are clustered in two areas.
Next, we will perform the Breusch-Pagan test on the model.
library(lmtest)
log_pl_orbsmax_mod_exh_back_bic_bp <- bptest(log_pl_orbsmax_mod_exh_back_bic)
The p-value from the Breusch-Pagan test for the model is \(0.3016\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.3016 \gt \alpha\). This suggests that homoscedasticity is not suspect.
We will check the normality assumption for our model using a Q-Q plot, and using a Shapiro-Wilk test.
By visual inspection, the data seems to consistent with normality.
Next, we will perform the Shapiro-Wilk test on the model.
log_pl_orbsmax_mod_exh_back_bic_sw <- shapiro.test(resid(log_pl_orbsmax_mod_exh_back_bic))
The p-value from the Shapiro-Wilk test for the model is \(0.3725\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.3725 \gt \alpha\). This suggests that normality is not suspect.
We will see how influential points affect the coefficients. Here, we define influential points as those with a Cook’s distance larger than four divided by the number of points. We will remove the influential points and refit the model.
cd_log_pl_orbsmax_mod_exh_back_bic = cooks.distance(log_pl_orbsmax_mod_exh_back_bic)
log_pl_orbsmax_mod_exh_back_bic_fix = lm(log_pl_orbsmax_mod_exh_back_bic, data = planets_good, subset = cd_log_pl_orbsmax_mod_exh_back_bic <= 4 / length(cd_log_pl_orbsmax_mod_exh_back_bic))
summary(log_pl_orbsmax_mod_exh_back_bic_fix)
##
## Call:
## lm(formula = log_pl_orbsmax_mod_exh_back_bic, data = planets_good,
## subset = cd_log_pl_orbsmax_mod_exh_back_bic <= 4/length(cd_log_pl_orbsmax_mod_exh_back_bic))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0477 -0.7670 0.0016 0.7817 2.6925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6205 0.2512 2.47 0.014 *
## pl_localeSpace 1.1508 0.1708 6.74 3.9e-11 ***
## st_optmag -0.4124 0.0197 -20.93 < 2e-16 ***
## st_vsini -0.0565 0.0108 -5.22 2.5e-07 ***
## log(pl_bmasse) 0.3971 0.0286 13.86 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.1 on 566 degrees of freedom
## Multiple R-squared: 0.591, Adjusted R-squared: 0.588
## F-statistic: 204 on 4 and 566 DF, p-value: <2e-16
Overall, the coefficients of the model fit with the influential points removed is similar to those of the model fit with the influential points included. The adjusted \(R^2\) for the model without influential points is somewhat higher with a value of 0.588.
Doing a Breusch-Pagan test on the model fit without influential points, we see that the p-value from the test is \(0.0037\). For an often used significance level of \(\alpha = 0.01\), this p-value of \(0.0037 \lt \alpha\). This suggests that homoscedasticity is suspect. Therefore, we will continue to use the model derived from data with the influential points.
Next, we will look at the variance inflation factor (VIF) to examine collinearity in our model variables. The following table lists the predictors in our model and its corresponding VIF.
| Variable name | VIF |
|---|---|
pl_locale |
1.597 |
st_optmag |
1.539 |
st_vsini |
1.035 |
log(pl_bmasse) |
1.104 |
None of the model variables has a VIF greater than five, so we do not suspect any significant collinearity in our model.
| Model | Adjusted \(R^2\) | LOOCV RMSE | Breusch-Pagan test decision | Shapiro-Wilks test decision | # of Predictors |
|---|---|---|---|---|---|
| Additive Model | 0.3031 | 1.259 | Reject | Reject | 14 |
| Backward BIC Model | 0.2957 | 1.252 | Reject | Reject | 7 |
| Logarithmic Additive Model | 0.5940 | 1.117 | Reject | Reject | 14 |
| Logarithmic Backward BIC Model | 0.5892 | 1.112 | Reject | Reject | 9 |
| Logarithmic Exhaustive Model | 0.5387 | 1.177 | Fail to Reject | Fail to Reject | 8 |
| Logarithmic Exhaustive Backward BIC Model | 0.5304 | 1.184 | Fail to Reject | Fail to Reject | 4 |
| Logarithmic Exhaustive Backward BIC Model without Influential Points | 0.5880 | 1.102 | Reject | Fail to Reject | 4 |
We prefer models where assumptions of homoscedasticity and normality are not suspect. While the logarithmic additive and logarithmic backward BIC models had larger adjusted \(R^2\) values and lower LOOCV-RMSE values, the Breusch-Pagan and Shapiro-Wilks tests results would reject the hypothesis of constant variance and normality, respectively. The logarithmic exhaustive backward BIC model has comparable adjusted \(R^2\) and LOOCV-RMSE values and with only four predictors compared to eight predictors for its counterpart without applying a backward BIC search. Finally, we would choose the logarithmic exhaustive backward BIC model with influential points over its counterpart without influential points because the assumption of constant variance is suspect for the latter.
The R summary for the model we have chosen to examine the orbital semi-major axis of exoplanets is given below.
summary(log_pl_orbsmax_mod_exh_back_bic)
##
## Call:
## lm(formula = log(pl_orbsmax) ~ pl_locale + st_optmag + st_vsini +
## log(pl_bmasse), data = planets_good)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.832 -0.826 -0.003 0.832 3.166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.32730 0.25613 1.28 0.2
## pl_localeSpace 1.11819 0.17278 6.47 2.0e-10 ***
## st_optmag -0.38017 0.02037 -18.67 < 2e-16 ***
## st_vsini -0.04028 0.00611 -6.60 9.4e-11 ***
## log(pl_bmasse) 0.38803 0.02919 13.29 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.18 on 589 degrees of freedom
## Multiple R-squared: 0.534, Adjusted R-squared: 0.53
## F-statistic: 168 on 4 and 589 DF, p-value: <2e-16
The model we have chosen to predict the (logarithm of) planet orbital semi-major axis is an additive model with four predictors which are the locale of discovery, the stellar apparent magnitude, the stellar projected rotational velocity, and the logarithm of planet mass.
The coefficient for the locale of discovery is 1.1182. This means that exoplanets discovered using space-based observations are generally found to be further from their host stars than those discovered using ground-based observations by a factor of 3.0593. Most space-based exoplanet discoveries were made using the Kepler space observatory (https://exoplanetarchive.ipac.caltech.edu/docs/counts_detail.html). Kepler uses the transit search method to find exoplanets which is theoretically more sensitive to planets with wide orbits than is the radial velocity method which is the other major method for planet detection.
The coefficient for stellar apparent magnitude is -0.3802. This indicates that exoplanets with tighter orbits tend to be found around stars with higher apparent magnitude, and vice versa, i.e., exoplanets with wider orbits tend to be found around stars with lower apparent magnitude. The apparent magnitude of a star is a measure of its brightness as seen by an observer on Earth. Stellar magnitude is an inverse relation, i.e., the brighter an object appears, the lower its magnitude value and vice versa. Therefore, exoplanets with small orbits are more often found around faint stars as observed from Earth. This is not unexpected since the signal from faint stars has higher noise than from bright stars, and planets close-in to their host star are intrinsically more likely to be detected by the two most common techniques of planet detection, radial velocity method and transit search, than those farther from their host star.
The coefficient for stellar projected rotational velocity is -0.0403. For every unit increase in projected rotational velocity, there is a slight decrease in the orbital semi-major axis of a confirmed exoplanet. In other words, one is more likely to find an exoplanet in a tight orbit around a rapidly rotating star, and in a wide orbit around a slowly rotating star. This effect could come from observational bias. The radial velocity method for finding exoplanets relies on tracking spectral features from starlight, and stellar rotation broadens these features and making them less trackable. As mentioned previously, the radial velocity method is also more sensitive to planets in tighter orbits, so the radial velocity indicators from planets orbiting tightly around a rapidly rotating star are more likely to accurately observed than wider orbiting counterparts.
The coefficient of (the logarithm of) planet mass is 0.388. This means that more massive planets tend to be found orbiting farther from their host star than their less massive counterparts. This planet mass-orbit relation scales as a power law with an exponent of 0.388 (the corresonding coefficient in our model). This result could be physical, observational, or both. The massive gas giant planets in our solar system have wider orbits than the small terrestrial planets. Massive planets are more detectable than less massive planets, and the two most common methods for finding exoplanets, radial velocity method and transit search, are more sensitive to planets in closer orbits.
Our model is useful by highlighting the significant difference in planet discoveries made from space and those made on the ground. Our model can also be used to estimate the detection limits of finding planets around faint stars and rapidly rotating stars. Ideally, we would like to discover an Earth-mass exoplanet orbiting its host star at a distance where liquid water could exist and support life. The planet mass-orbit relation derived from our analysis of confirmed exoplanets could help to tell us how close we are or how much improvement we must make in our planet searches in order to discovery another Earth.
The following code lists helper R functions that calculate or retrieve statistical values and report statistical decisions.
get_adj_r2 <- function(model) {
summary(model)$adj.r.squared
}
get_loocv_rmse <- function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
get_num_params <- function(model) {
length(coef(model))
}
get_bp_decision <- function(model, alpha) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_sw_decision <- function(model, alpha) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
The following code is the multi-nested loop used to generate strings that were parsed and evaluated by R to fit a collection models in the exhaustive model search.
lm_str_head <- "cur_mod=lm(log(pl_orbsmax)~"
lm_str_tail <- ",data=planets_good)"
lm_str <- rep("", n)
i <- 1
for (ipl_discmethod in c("1", "pl_discmethod")) {
for (ipl_disc in c("1", "pl_disc")) {
for (ipl_locale in c("1", "pl_locale")) {
for (ist_dist in c("1", "st_dist", "log(st_dist)")) {
for (ist_optmag in c("1", "st_optmag", "exp(st_optmag)")) {
for (ist_teff in c("1", "st_teff", "log(st_teff)")) {
for (ist_mass in c("1", "st_mass", "log(st_mass)")) {
for (ist_rad in c("1", "st_rad", "log(st_rad)")) {
for (ist_logg in c("1", "st_logg", "exp(st_logg)")) {
for (ist_metfe in c("1", "st_metfe", "exp(st_metfe)")) {
for (ist_vsini in c("1", "st_vsini")) {
for (ipl_orbeccen in c("1", "pl_orbeccen")) {
for (ipl_bmasse in c("1", "pl_bmasse", "log(pl_bmasse)")) {
lm_str[i] <- paste(ipl_discmethod,
ipl_disc,
ipl_locale,
ist_dist,
ist_optmag,
ist_teff,
ist_mass,
ist_rad,
ist_logg,
ist_metfe,
ist_vsini,
ipl_orbeccen,
ipl_bmasse,
sep = "+")
i <- i + 1
}
}
}
}
}
}
}
}
}
}
}
}
}